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Full text available: ^pdf(411.53 KB) 



In recent years, formal methods have emerged as an alternative approach to ensuring the 
quality and correctness of hardware designs, overcoming some of the limitations of 
traditional validation techniques such as simulation and testing. There are two main aspects 
to the application of formal methods in a design process: the formal framework used to 
specify desired properties of a design and the verification techniques and tools used to 
reason about the relationship between a spec ... 

Keywords: case studies, formal methods, formal verification, hardware verification, 
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A minimum energy routing protocol reduces the energy consumption of the nodes in a 
wireless ad hoc network by routing packets on routes that consume the minimum amount of 
energy to get the packets to their destination. This paper identifies the necessary features of 
an on-demand minimum energy routing protocol and suggests mechanisms for their 
implementation. We highlight the importance of efficient caching techniques to store the 
minimum energy route information and propose the use of an ... 
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This paper reports the results of a study of VAX 8800 processor performance using a 
hardware monitor that collects histograms of the processor's micro-PC and memory bus 
status. The monitor keeps a count of all machine cycles executed at each micro-PC location, 



as well as counting all occurrences of each bus transaction. It can measure a running 
system without interfering with it, and this paper's results are based on measurements of 
live timesharing. Because the 8800 is a microcoded machine ... 

4 A processor for a hig h- performance personal computer 
Butler W. Lampson, Kenneth A. Pier 
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Full text available: Tjqpdf( 1.24 MB) __. 
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This paper describes the design goals, micro- architecture, and implementation of the 
microprogrammed processor for a compact high performance personal computer. This 
computer supports a range of high level language environments and high bandwidth I/O 
devices. Besides the processor, it has a cache, a memory map, main storage, and an 
instruction fetch unit; these are described in other papers. The processor can be shared 
among 16 microcoded tasks, performing microcode context switches ... 

5 Performance and architectural evaluation of the PSI machine 
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Full text available: to pdf(980.44 KB ) 
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This paper reports the results of a study of VAX- 11/780 processor performance using a 
novel hardware monitoring technique. A micro-PC histogram monitor was built for these 
measurements. It keeps a count of the number of microcode cycles executed at each 
microcode location. Measurement experiments were performed on live timesharing 
workloads as well as on synthetic workloads of several types. The histogram counts allow 
the calculation of the frequency of various architectural events, such as ... 

A processor for a hig h-performance personal computer 
Butler W. Lampson, Kenneth A. Pier 

August 1998 25 years of the international symposia on Computer architecture 
(selected papers) 
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Full text available: to pdf (1.38 MB) ' ' ' 
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The design of high-performance multiprocessor systems necessitates a careful analysis of 
the memory system performance of parallel programs. Lacking multiprocessor address 
traces, previous multiprocessor performance studies using analytical models had to make an 
inordinate number of assumptions about the underlying memory reference patterns. We 
previously developed a scheme called ATUM - Address Tracing Using Microcode - to get 



reliable operating system and multiprogramming traces on single 
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The VAX architecture is a popular ISP architecture that has been implemented in several 
different technologies targeted to a wide range of performance specifications. However, it 
has been argued that the VAX has specific characteristics which preclude a very high 
performance implementation. We have developed a microarchitecture (HPS) which is 
specifically intended for implementing very high performance computing engines. Our model 
of execution is a restriction on fine granularity data flow. ... 
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networks 

Pavan Nuggehalli, Vikram Srinivasan, Carla-Fabiana Chiasserini 

June 2003 Proceedings of the fourth ACM international symposium on Mobile ad hoc 
networking & computing 

Full text available: ^ pdf(244.72 KB) Additional Information: full citation , abstract , references , index terms 

In this paper, we address the problem of energy-conscious cache placement in wireless ad 
hoc networks. We consider a network comprising a server with an interface to the wired 
network, and some nodes requiring access to the information stored at the server. In order 
to reduce access latency in such a communication environment, an effective strategy is 
caching the server information at some nodes distributed across the network. Caching, 
however, can considerably impact the system energy expenditu ... 
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Full text available: T Sj pdf(1.03 MB) ~ ^ 
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Increasing the execution power requires a high instruction issue bandwidth, and decreasing 
instruction encoding and applying some code improving techniques cause code expansion. 
Therefore, the instruction memory hierarchy performance has become an important factor of 
the system performance. An instruction placement algorithm has been implemented in the 
IMPACT-I (Illinois Microarchitecture Project using Advanced Compiler Technology - Stage I) 
C compiler to maximize the sequential and spatial ... 

13 Cache Performance in the VAX-1 1/780 
Douglas W. Clark 

February 1983 ACM Transactions on Computer Systems (TOCS), volume 1 issue 1 
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Ravi Nair, Martin E. Hopkins 

May 1997 ACM SIGARCH Computer Architecture News , Proceedings of the 24th 

annual international symposium on Computer architecture, volume 25 issue 2 
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Full text available: TO pdf(2.01 MB) 53 

terms 

Modern processors employ a large amount of hardware to dynamically detect parallelism in 
single-threaded programs and maintain the sequential semantics implied by these 
programs. The complexity of some of this hardware diminishes the gains due to parallelism 
because of longer clock period or increased pipeline latency of the machine. In this paper we 
propose a processor implementation which dynamically schedules groups of instructions 
while executing them on a fast simple engine and caches them f ... 

15 Cache behavior of combinator graph reduction 
Philip J. Koopman, Peter Lee, Daniel P. Siewiorek 

April 1992 ACM Transactions on Programming Languages and Systems (TOPLAS), 

Volume 14 Issue 2 
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The results of cache-simulation experiments with an abstract machine for reducing 
combinator graphs are presented. The abstract machine, called TIGRE, exhibits reduction 
rates that, for similar kinds of combinator graphs on similar kinds of hardware, compare 
favorably with previously reported techniques. Furthermore, TIGRE maps easily and 
efficiently onto standard computer architectures, particularly those that allow a restricted 
form of self-modifying code. This provides some indication th ... 
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June 1986 Conference proceedings on Object-oriented programming systems, 
languages and applications 

Full text available- IS) pdf(849 10 KB) Additional Information: full citation , abstract , references , citings , index 
. L£l terms 

A processor for the Smalltalk-80 1 programming language is described. This machine is 
implemented using a standard bit slice ALU and sequencer, TTL MSI, and NMOS LSI RAMS. 
It executes an instruction set similar to the Smalltalk-80 virtual machine instruction set. The 
data paths of the machine are optimized for rapid Smalltalk-80 execution by the inclusion of 
a context cache, tag checking, and a hardware method cache. Each context is only partly 
initialized when created, and has no memor ... 



18 Performance from architecture: comparing a RISC and a CISC with similar hardware 
org anization 

Dileep Bhandarkar, Douglas W. Clark 




April 1991 ACM SIGARCH Computer Architecture News , Proceedings of the fourth 
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Chingren Lee, Jenq Kuen Lee, Tingting Hwang, Shi-Chun Tsai 

April 2003 ACM Transactions on Design Automation of Electronic Systems (TODAES), 

Volume 8 Issue 2 
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In this article, we investigate compiler transformation techniques regarding the problem of 
scheduling VLIW instructions aimed at reducing power consumption of VUW architectures in 
the instruction bus. The problem can be categorized into two types: horizontal scheduling 
and vertical scheduling. For the case of horizontal scheduling, we propose a bipartite- 
matching scheme for instruction scheduling. We prove that our greedy bipartite-matching 
scheme always gives the optimal switching activities ... 

Keywords: Compilers, VLIW instruction scheduling, instruction bus optimizations, low- 
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Full text available: ^ pdf(549.12 KB) Additional Information: full citation , abstract , references , index terms 

The modification or tuning of the micro-code in a computer that utilizes a writable control 
store is one method whereby a program's execution time can be improved. A method for 
automatically performing a microcode tuning or synthesis has been developed by Drs. Karl- 
gaard and Abd-Alla and is discussed in detail in [1]. Presented is an extension of this effort 
which allows the microcode tuning to be performed on-line on program loops. This is 
accomplished by gathering data on characteristics ... 

22 MIDL - a microinstruction description language 
Marleen Sint 

December 1981 Proceedings of the 14th annual workshop on Microprogramming 
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Full text available:^ pdf(848.63 KB) 



A microinstruction description language called MIDL is introduced. A MIDL description of a 
microarchitecture defines the semantics and triggering conditions of all microoperations. It 
also defines operand selection. MIDL incorporates a timing model that allows detailed 
specification of the timing of each microoperation, and a sequencing model that allows the 
description of many different sequencing schemes. 



23 A messag e passin g coprocessor for distributed memory multicomputers 
Jiun-Ming Hsu, Prithviraj Banerjee 

November 1990 Proceedings of the 1990 ACM/IEEE conference on Supercomputing 

Full text available: ^ pdf(1.25 MB ) Additional Information: full citation , abstract , references 

This paper presents the architecture, methodology and performance evaluation of a 
message passing coprocessor (MPC) which can accelerate message communication in a 
distributed memory multicomputer. The MPC is a microprogrammable processor which off- 
loads the CPU of the burden of communication and speeds up the software processing by 
directly executing message passing instructions in microcode. It supports process 
scheduling, message buffer management, and fast buffer copying. The most uni ... 



24 Afunctional level simulation engine of MAN-YO: a special pur pose parallel machine for Q 
logic desi g n automation 
T. Nakata, N. Koike 

June 1986 ACM SIGARCH Computer Architecture News , Proceedings of the 13th 



annual international symposium on Computer architecture, volume 14 issue 2 
Full text available: ^| pdf(51 1 .58 KB) Additional Information: full citation , abstract , references , index terms 

The architecture of a proto-type functional level simulator element of a massively parallel 
machine (MAN-YO) designed for logic design automation is presented. At functional level, 
hardware systems are described in a hardware description language, FDL. The FDL 
description is compiled into stack oriented intermediate language instructions. 
Communicating with other gate level/block level/ functional level processors, each functional 
simulator interprets the compiled instructions and simulates ... 

25 A PISE implementation of dynamic code decompression 
Marc L. Corliss, E. Christopher Lewis, Amir Roth 

June 2003 ACM SIGPLAN Notices , Proceedings of the 2003 ACM SIGPLAN conference 

on Language, compiler, and tool for embedded systems, volume 38 issue 7 
Full text available: ^ pdf(291.52 KB) Additional Information: full citation , abstract , references , index terms 

Code compression coupled with dynamic decompression is an important technique for both 
embedded and general-purpose microprocessors. Post-fetch decompression, in which 
decompression is performed after the compressed instructions have been fetched, allows the 
instruction cache to store compressed code but requires a highly efficient decompression 
implementation. We propose implementing post-fetch decompression using dynamic 
instruction stream editing (DISE), a programmable decoder— ... 

Keywords: DISE, code compression, code decompression 



26 Exploiting parallel microprocessor microarchitectures with a compiler code generator 
W. W. Hwu, P. P. Chang 

May 1988 ACM SIGARCH Computer Architecture News , Proceedings of the 15th 

Annual International Symposium on Computer architecture, volume 16 issue 2 
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With advances in VLSI technology, microprocessor designers can provide more 
microarchitectural parallelism to increase performance. We have identified four major forms 
of such parallelism: multiple microoperations issued per cycle, multiple result distribution 
buses, multiple execution units, and pipelined execution units. The experiments reported in 
this paper address two important issues: The effects of these forms and the appropriate 
balance among them. A central microar ... 
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David A. Patterson 
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Full text available* IS df (4 06 MB) Additional Information: full citation , abstract , references , citings , index 
• 12J P terms 



Reduced instruction set computers aim for both simplicity in hardware and synergy between 
architectures and compilers. Optimizing compilers are used to compile programming 
languages down to instructions that are as unencumbered as microinstructions in a large 
virtual address space, and to make the instruction cycle time as fast as possible. 
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John R. Hayes, Martin E. Fraeman, Robert L Williams, Thomas Zaremba 
October 1987 Proceedings of the second international conference on Architectual 
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29 Using the Alfa-1 simulated processor for educational purposes 
Gabriel A. Wainer, Sergio Daicz, Luis F. De Simoni, Demian Wassermann 
December 2001 Journal on Educational Resources in Computing (JERIC), volume 1 issue 4 

Full text available: Q pdf(238.65 KB) Additional Information: full citation , abstract , references , index terms 

Alfa-1 is a simulated computer designed for computer organization courses. Alfa-1 and its 
accompanying toolkit allow students to acquire practical insights into developing hardware 
by extending existing components. The DEVS formalism is used to model individual 
components and to integrate them into a hierarchy that describes the detailed behavior of 
different levels of a computers architecture. We introduce Alfa-1 and the toolkit, show how 
to extend existing components, and describe how ... 

Keywords: DEVS formalism, modeling computer architectures, systems specification 



30 The micro-architecture of the ECLIPSE© MV/8000: Conception and implementation 
Jonathan S. Blau, Charles J. Holland, David L. Keating 

November 1980 Proceedings of the 13th annual workshop on Microprogramming 

Full text available: ^ pdf(622.95 KB) Additional Information: full citation , abstract , index terms 

The microcode of the ECLIPSE MV/8000 controls the hardware to emulate an instruction set. 
In the MV/8000 the micro-architecture is defined and limited by the following constraints: 1) 
the desire to implement microcode in a limited number of locations; 2) the use of LSI 
technology; 3) a virtual memory architecture. This paper will attempt to show how each of 
these factors contributed to the micro-architecture, to describe that architecture, and to 
relat ... 



31 Evaluation of a hig h performance code compression method 
Charles Lefurgy, Eva Piccininni, Trevor Mudge 

November 1999 Proceedings of the 32nd annual ACM/IEEE international symposium on 
Microarchitecture 
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Publisher Site 

Compressing the instructions of an embedded program is important for cost-sensitive low- 
power control-oriented embedded computing. A number of compression schemes have been 
proposed to reduce program size. However, the increased instruction density has an 
accompanying performance cost because the instructions must be decompressed before 
execution. In this paper, we investigate the performance penalty of a hardware-managed 
code compression algorithm recently introduced in IBM's PowerPC 405. ... 



32 Performance effects of architectural complexity in the Intel 432 
Robert P. Colwell, Edward F. Gehringer, E. Douglas Jensen 

August 1988 ACM Transactions on Computer Systems (TOCS), volume 6 issue 3 

Full text available* IS pdf(3 45 MB) Additional Information: full citation , abstract , references , citing s, index 
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The Intel 432 is noteworthy as an architecture incorporating a large amount of functionality 
that most other systems perform by software. It has, in effect, "migrated" this functionality 
from the software into the microcode and hardware. The benefits of functional migration 
have recently been a subject of intense controversy, with critics claiming that a complex 
architecture is inherently less efficient than a simple architecture with good software 
support. This paper examines t ... 



33 A fill-unit a p proach to multiple instruction issue 
Manoj Franklin, Mark Smotherman 

November 1994 Proceedings of the 27th annual international symposium on 
Microarchitecture 
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Multiple issue of instructions occurs in superscalar and VUW machines. This paper 
investigates a third type of machine design, which combines the advantages of code 
compatibility as in superscalars and the absence of complex dependency-checking logic from 
the decoder as in VUW. In this design, a stream of scalar instructions is executed by the 
hardware and is simultaneously compacted into VUW-type instructions, which are then 
stored in a structure called a shadow cache. When a shadow cac ... 

Keywords: VUW, instruction-level parallelism, multiple operation issue, superscalar 



34 Design considerations for the VLSI processor of X-TREE 
David A. Patterson, E. Scott Fehr, Carlo H. Sequin 
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X-NODE is a single-chip VLSI processor to be realized in the mid 1980's and to be used as a 
building block for a tree-structured multiprocessor system (X-TREE). Three major trends 
influence the design of this processor: the continuing evolution of VLSI technology, the 
requirements for parallelism and communication in a multiprocessor system, and the need 
for better support of software and high level language constructs. The influence of these 
trends on the processor architecture are discuss ... 



35 Multiple instruction issue in the NonStop cyclone processor 
Robert W. Horst, Richard L. Harris, Robert L. Jardine 

May 1990 ACM SIGARCH Computer Architecture News , Proceedings of the 17th 

annual international symposium on Computer Architecture, volume is issue 3 
Full text available:^ pdf( 1.06 MB) Additional Information: full citation , abstract , references , index terms 

This paper describes the architecture for issuing multiple instructions per clock in the 
NonStop Cyclone Processor. Pairs of instructions are fetched and decoded by a dual two- 
stage prefetch pipeline and passed to a dual six-stage pipeline for execution. Dynamic 
branch prediction is used to reduce branch penalties. A unique microcode routine for each 
pair is stored in the large duplexed control store. The microcode controls parallel data paths 
optimized for executing the most frequent instr ... 



36 Difficult-path branch prediction using subordinate microthreads 
Robert S. Chappell, Francis Tseng, Adi Yoaz, Yale N. Patt 

May 2002 ACM SIGARCH Computer Architecture News , Proceedings of the 29th 

annual international symposium on Computer architecture, volume 30 issue 2 
Full text available: ^ pdf(1.14 MB) Additional Information: full citation , abstract , references 

Branch misprediction penalties continue to increase as microprocessor cores become wider 
and deeper. Thus, improving branch prediction accuracy remains an important challenge. 
Simultaneous Subordinate Microthreading (SSMT) provides a means to improve branch 
prediction accuracy. SSMT machines run multiple, concurrent microthreads in support of the 
primary thread. We propose to dynamically construct microthreads that can speculatively 
and accurately pre-compute branch outcomes along frequently mis ... 
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L Peter Deutsch 

August 1980 Proceedings of the 1980 ACM conference on LISP and functional 
programming 
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This paper describes in detail the most interesting aspects of ByteLisp, a transportable Lisp 
system architecture which implements the Interlisp dialect of Lisp, and its first 



implementation, on a microprogrammed minicomputer called the Alto. Two forthcoming 
related papers will deal with general questions of Lisp machine and system architecture, and 
detailed measurements of the Alto ByteLisp system described here. A highly condensed 
summary of the series was published at MICRO-11 in Novembe ... 

38 PSCP: a scalable parallel ASIP architecture for reactive systems 
A. Pyttel, A. Sedlmeier, C. Veith 

February 1998 Proceedings of the conference on Design, automation and test in Europe 
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We describe a Codesign approach based on a parallel and scalable ASIP architecture, which 
is suitable for the implementation of reactive systems. The specification language of our 
approach is extended statecharts. Our ASIP architecture is scalable with respect to the 
number of processing elements as well as parameters such as bus widths and register file 
sizes. Instruction sets are generated from a library of components covering a spectrum of 
space/time trade-off alternatives. Our approach featu ... 

Keywords: FPGA, application-specific, statechart, modular 
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LOOM (Large Object-Oriented Memory) is a virtual memory implemented in software that 
supports the Smalltalk-80(™) programming language and environment on the Xerox Dorado 
computer. LOOM provides 8 billion bytes of secondary memory address space and is 
specifically designed to run on computers with a narrow word size (16-bit wide words). All 
storage is viewed as objects that contain fields. Objects may have an average size as small 
as 10 fields. LOOM swaps objects between primary and s ... 
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June 1987 Proceedings of the 14th annual international symposium on Computer 
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A new method of implementing branch instructions is presented. This technique has been 
implemented in the CRISP Microprocessor. With a combination of hardware and software 
techniques the execution time cost for many branches can be effectively reduced to zero. 
Branches are folded into other instructions, making their execution as separate instructions 
unnecessary. Branch Folding can reduce the apparent number of instructions needed to 
execute a program by the number of bran ... 
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41 An environment for research in microp ro grammin g and emulation 
Robert F. Rosin, Gideon Frieder, Richard H. Eckhouse 
August 1972 Communications of the ACM, volume is issue 8 

Full text available: Qpdf(1.33 MB) Additional Information: full citation , abstract , references , citings 

The development of the research project in microprogramming and emulation at State 
University of New York at Buffalo consisted of three phases: the evaluation of various 
possible machines to support this research; the decision to purchase one such machine, 
which appears to be superior to the others considered; and the organization and definition 
of goals for each group in the project. Each of these phases is reported, with emphasis 
placed on the early results achieved in this research. 

Keywords: computer systems, emulation, hardware evaluation, input-output systems, 
language processors, microprogramming, nanoprogram, project management 
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42 Phase coupling and constant generation in an optimizing microcode compiler 
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The designer of an optimizing compiler must concern himself with the order in which 
optimization phases are performed; a pair of phases may be interdependent in the sense 
that each phase could benefit from information produced by the other. In a compiler for a 
horizontal target architecture, one such phase-ordering problem occurs between code- 
generation and compaction. Presented here is an overview of a research effort at Carnegie- 
Mellon University which ha ... 

43 Performance of Lisp systems 
Richard P. Gabriel, Larry M. Masinter 

August 1982 Proceedings of the 1982 ACM symposium on LISP and functional 
programming 
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This paper describes the issues involved in evaluating the performance of Lisp systems. We 
explore the various levels at which quantitative statements can be made about the 
performance of a Lisp system, giving examples from existing implementations wherever 
possible. Our thesis is that benchmarking is most effective when performed in conjunction 
with an analysis of the underlying Lisp implementation and computer architecture. We 
examine some simple benchmarks which have been used to measure ... 



44 Session 3: Energy-aware OS's: The benefits of event: driven energy accounting in 

power-sensitive systems 
Frank Bellosa 

September 2000 Proceedings of the 9th workshop on ACM SIGOPS European workshop: 
beyond the PC: new challenges for the operating system 

Full text available: ^ pdf(86.80 KB) Additional Information: full citation , abstract , references , citings 

A prerequisite of energy-aware scheduling is precise knowledge of any activity inside the 
computer system. Embedded hardware monitors (e.g., processor performance counters) 
have proved to offer valuable information in the field of performance analysis. The same 
approach can be applied to investigate the energy usage patterns of individual threads. We 
use information about active hardware units (e.g., integer/floating-point unit, 
cache/memory interface) gathered by event counters to establish at... 



45 Performance of the VAX-1 1/780 translation buffer: simulation and measurement 
Douglas W. Clark, Joel S. Emer 
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A virtual-address translation buffer (TB) is a hardware cache of recently used virtual-to- 
physical address mappings. The authors present the results of a set of measurements and 
simulations of translation buffer performance in the VAX-11/780. Two different hardware 
monitors were attached to VAX-11/780 computers, and translation buffer behavior was 
measured. Measurements were made under normal time-sharing use and while running 
reproducible synthetic time-sharing work loads. Reported measure ... 



46 Code generation and schedulin g : Compiler optimization on instruction scheduling for 
low power 

Chingren Lee, Jenq Kuen Lee, TingTing Hwang, Shi-Chun Tsai 

September 2000 Proceedings of the 13th international symposium on System synthesis 

Full text available: ^ pdf(375.33 KB) Additional Information: full citation , abstract 

In this paper, we investigate the compiler transformation techniques to the problem of 
scheduling VLIW instructions aimed to reduce the power consumption on the instruction bus. 
It can be categorized into two types: horizontal and vertical scheduling. For the horizontal 
case, we propose a bipartite-matching scheme. We prove that our greedy algorithm always 
gives the optimal switching activities of the instruction bus. In the vertical case, we prove 
that the problem is NP-hard, and propose a heur ... 
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Charles P. Thacker, Lawrence C. Stewart 

October 1987 Proceedings of the second international conference on Architectual 
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48 The hardware architecture of the CRISP microprocessor 
D. R. Ditzel, H. R. McLellan, A. D. Berenbaum 

June 1987 Proceedings of the 14th annual international symposium on Computer 
architecture 
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49 A trace-driven simulation methodology 



Humayun Khalid 




December 1995 ACM SIGARCH Computer Architecture News, volume 23 issue 5 
Full text available: ^ pdf(573.30 KB) Additional Information: full citation , abstract , index terms 

This paper presents a simulation methodology for evaluating the performance of CISC 
computers. The method is called Message Flow Technique (MFT). MFT has several 
advantages over Instruction Flow Technique (IFT) we presented in [1], The proposed 
methodology is applied to a single and two-level cache CISC system using 80486 SX as a 
case study. It was found that with a single-level on-chip cache of size 8K, the performance 
of the system is considerably limited by the service time of BIU(Bus In ... 

Keywords: gibson mix, modified, multi-level cache, performance evaluation, simulation 
methodologies 



50 ATUM: a new technique for capturing address traces using microcode 
A. Agarwal, R. L. Sites, M. Horowitz 
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Trace-driven simulation is often used in the design of computer systems, especially caches 
and translation lookaside buffers. Capturing address traces to drive such simulations has 
been problematic, often involving 1000:1 software overhead to trace a target workload, 
and/or mechanisms that cause significant distortions in the recorded data. A new technique 
for capturing address traces has been developed to use a processor's microcode to record 
addresses in a reserved part of main memory as ... 

51 Executing compressed pro g rams on an embedded RISC architecture 
Andrew Wolfe, Alex Chanin 

December 1992 ACM SIGMICRO Newsletter , Proceedings of the 25th annual 

international symposium on Microarchitecture, volume 23 issue 1-2 
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52 Massive arrays of idle disks for stora g e archives 
Dennis Colarelli, Dirk Grunwald 

November 2002 Proceedings of the 2002 ACM/IEEE conference on Supercomputing 

Full text available:^ pdf(751.87 KB ) Additional Information: full citation , abstract , references 

The declining costs of commodity disk drives is rapidly changing the economics of deploying 
large amounts of online or near-line storage. Conventional mass storage systems use either 
high performance RAID clusters, automated tape libraries or a combination of tape and disk. 
In this paper, we analyze an alternative design using massive arrays of idle disks, or MAID. 
We argue that this storage organization provides storage densities matching or exceeding 
those of tape libraries with perform ... 

53 Design tradeoffs to su p port the C programmin g lan g ua g e in the CRISP microprocessor 
David R. Ditzel, Hubert R. McLellan, Alan D. Berenbaum 

October 1987 Proceedings of the second international conference on Architectual 
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This paper describes the architecture and the current implementation of the hardware 
unification unit (HUU). The HUU performs the literal unification operation in Prolog 
processing. It is designed as a coprocessor to a host system that handles other operations of 
Prolog processing such as bookkeeping and sequencing. After the host system provides 
input values to the HUU and activates it, the HUU works independently from the host 
system; when it finishes its operation it reports the result t ... 



55 Mobility: PATHS: analysis of PATH duration statistics and their impact on reactive 
MANET routing protocols 

Narayanan Sadagopan, Fan Bai, Bhaskar Krishnamachari, Ahmed Helmy 
June 2003 Proceedings of the fourth ACM international symposium on Mobile ad hoc 
networking & computing 

Full text available: ^ pdf(31 1 .57 KB) Additional Information: full citation , abstract , references , index terms 

We develop a detailed approach to study how mobility impacts the performance of reactive 
MANET routing protocols. In particular we examine how the statistics of path durations 
including PDFs vary with the parameters such as the mobility model, relative speed, number 
of hops, and radio range. We find that at low speeds, certain mobility models may induce 
multi-modal distributions that reflect the characteristics of the spatial map, mobility 
constraints and the communicating traffic pattern. Howev ... 
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Intergraph's CLIPPER microprocessor is a high performance, three chip module that 
implements a new instruction set architecture designed for convenient programmability, 
broad functionality, and easy future expansion. 
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December 1987 Proceedings of the 20th annual workshop on Microprogramming 
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The approach to speed up a Lisp interpreter by implementing it in firmware seems 
promising. A microcoded Lisp interpreter shows good performance for very simple 
benchmarks, while it often fails to provide good performance for larger benchmarks and 
applications unless speedup techniques are devised for it. This was the case for the 
TAO/ELIS system. This paper describes various techniques devised for the TAO/ELIS system 
in order to speed up the interpreter of the TAO language implemented on t ... 
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A machine independent low level language YALLL is presented. This language produces 
microcode for two very different machines: Hewlett Packard HP 300 and Digital Equipment 
Corporation VAX 11/780. The efficiency of this language is tested by comparing two 
examples on both machines to microassembly coded versions. To our best knowledge, this 




is the first time programs have been compiled and executed on two different 
microarchitectures. These examples also let us compare the efficiency of the ... 
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The HPS Microarchitecture has been developed as an execution model for implementing 
various architectures at very high performance. A considerable amount of effort has gone 
into the use of HPS as a microarchitecture for the VAX. In this paper, we describe our first 
full simulation of the microVAX subset, and report the results of varying (i.e. tuning) certain 
important parameters. 
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Sheetalkumar Doshi, Shweta Bhandare, Timothy X Brown 

June 2002 ACM SIGMOBILE Mobile Computing and Communications Review, volume 6 
Issue 3 

Full text available: ^ pdf (203.93 KB) Additional Information: full citation , abstract , references 

A minimum energy routing protocol reduces the energy consumption of the nodes in a 
wireless ad hoc network by routing packets on routes that consume the minimum amount of 
energy to get the packets to their destination. This paper identifies the necessary features of 
an on-demand minimum energy routing protocol and suggests mechanisms for their 
implementation. We highlight the importance of efficient caching techniques to store the 
minimum energy route information and propose the use of an ... 

3 Selective cache ways: on-demand cache resource allocation 
David H. Albonesi 

November 1999 Proceedings of the 32nd annual ACM/IEEE international symposium on 
Microarchitecture 
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Increasing levels of microprocessor power dissipation call for new approaches at the 
architectural level that save energy by better matching of on-chip resources to application 
requirements. Selective cache ways provides the ability to disable a subset of the ways in a 
set associative cache during periods of modest cache activity, while the full cache may 
remain operational for more cache-intensive periods. Because this approach leverages the 
subarray partitioning that is already pr ... 
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Annual International Symposium on Computer architecture, volume 16 issue 2 
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The design of high-performance multiprocessor systems necessitates a careful analysis of 
the memory system performance of parallel programs. Lacking multiprocessor address 
traces, previous multiprocessor performance studies using analytical models had to make an 
inordinate number of assumptions about the underlying memory reference patterns. We 
previously developed a scheme called ATUM - Address Tracing Using Microcode - to get 
reliable operating system and multiprogramming traces on single ... 
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This paper presents the design and implementation of a high-performance special-purpose 
processor, called The White Dwarf, for accelerating finite element analysis algorithms. The 
White Dwarf CPU contains two Am29325 32-bit floating-point processors and one Am29332 
32-bit ALU, and employs a wide-instruction word architecture in which the application 
algorithm is directly implemented in microcode. The entire system is VME-bus compatible 
and interfaces with a SUN 31160 host. The syste ... 

A messag e passing coprocessor for distributed memory multicomputers 
Jiun-Ming Hsu, Prithviraj Banerjee 

November 1990 Proceedings of the 1990 ACM/IEEE conference on Supercomputing 

Full text available: ^ pdf(1.25 MB) Additional Information: full citation, abstract , references 

This paper presents the architecture, methodology and performance evaluation of a 
message passing coprocessor (MPC) which can accelerate message communication in a 
distributed memory multicomputer. The MPC is a microprogrammable processor which off- 
loads the CPU of the burden of communication and speeds up the software processing by 
directly executing message passing instructions in microcode. It supports process 
scheduling, message buffer management, and fast buffer copying. The most uni ... 

Performance of the VAX-1 1/780 translation buffer: simulation and measurement 
Douglas W. Clark, Joel S. Emer 

February 1985 ACM Transactions on Computer Systems (TOCS), volume 3 issue l 
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A virtual-address translation buffer (TB) is a hardware cache of recently used virtual-to- 
physical address mappings. The authors present the results of a set of measurements and 
simulations of translation buffer performance in the VAX-11/780. Two different hardware 
monitors were attached to VAX-11/780 computers, and translation buffer behavior was 
measured. Measurements were made under normal time-sharing use and while running 
reproducible synthetic time-sharing work loads. Reported measure ... 
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Intergraph^ CLIPPER microprocessor is a high performance, three chip module that 
implements a new instruction set architecture designed for convenient programmability, 
broad functionality, and easy future expansion. 

10 A PISE implementation of dynamic code decompression 
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Code compression coupled with dynamic decompression is an important technique for both 
embedded and general-purpose microprocessors. Post-fetch decompression, in which 
decompression is performed after the compressed instructions have been fetched, allows the 
instruction cache to store compressed code but requires a highly efficient decompression 
implementation. We propose implementing post-fetch decompression using dynamic 
instruction stream editing (DISE), a programmable decoder- ... 

Keywords: DISE, code compression, code decompression 
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This paper proposes a micro-architectural technique in which a prediction is made for some 
power-hungry units of a processor. The prediction consists of whether the result of a 
particular unit or block of logic will be useful in order to execute the current instruction. If it 
is predicted useless, then that block is disabled. It would be ideal if the predictions were 
totally accurate, thus not decreasing the instruction-per-cycle (IPC) performance metric. ... 

12 Testin g and Debug g ing Custom Inte g rated Circuits 
Edward H. Frank, Robert F. Sproull 

December 1981 ACM Computing Surveys (CSUR), volume 13 issue 4 
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13 Using the Alfa-1 simulated processor for educational purposes 
Gabriel A. Wainer, Sergio Daicz, Luis F. De Simoni, Demian Wassermann 
December 2001 Journal on Educational Resources in Computing (JERIC), volume 1 issue 4 
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Alfa-1 is a simulated computer designed for computer organization courses. Alfa-1 and its 
accompanying toolkit allow students to acquire practical insights into developing hardware 
by extending existing components. The DEVS formalism is used to model individual 
components and to integrate them into a hierarchy that describes the detailed behavior of 
different levels of a computer's architecture. We introduce Alfa-1 and the toolkit, show how 
to extend existing components, and describe how ... 

Keywords: DEVS formalism, modeling computer architectures, systems specification 
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The Intel 432 is noteworthy as an architecture incorporating a large amount of functionality 
that most other systems perform by software. It has, in effect, "migrated" this functionality 
from the software into the microcode and hardware. The benefits of functional migration 
have recently been a subject of intense controversy, with critics claiming that a complex 
architecture is inherently less efficient than a simple architecture with good software 
support. This paper examines t ... 



15 Difficult-path branch prediction using subordinate microthreads 
Robert S. Chappell, Francis Tseng, Adi Yoaz, Yale N. Part 

May 2002 ACM SIGARCH Computer Architecture News , Proceedings of the 29th 

annual international symposium on Computer architecture, volume 30 issue 2 
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Branch misprediction penalties continue to increase as microprocessor cores become wider 
and deeper. Thus, improving branch prediction accuracy remains an important challenge. 
Simultaneous Subordinate Microthreading (SSMT) provides a means to improve branch 
prediction accuracy. SSMT machines run multiple, concurrent microthreads in support of the 
primary thread. We propose to dynamically construct microthreads that can speculatively 
and accurately pre-compute branch outcomes along frequently mis ... 
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This paper describes the issues involved in evaluating the performance of Lisp systems. We 
explore the various levels at which quantitative statements can be made about the 
performance of a Lisp system, giving examples from existing implementations wherever 
possible. Our thesis is that benchmarking is most effective when performed in conjunction 
with an analysis of the underlying Lisp implementation and computer architecture. We 
examine some simple benchmarks which have been used to measure ... 
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Humayun Khalid 
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This paper presents a simulation methodology for evaluating the performance of CISC 
computers. The method is called Message Flow Technique (MFT). MFT has several 
advantages over Instruction Flow Technique (IFT) we presented in [1]. The proposed 
methodology is applied to a single and two-level cache CISC system using 80486 SX as a 
case study. It was found that with a single-level on-chip cache of size 8K, the performance 
of the system is considerably limited by the service time of BIU(Bus In ... 

Keywords: gibson mix, modified, multi-level cache, performance evaluation, simulation 
methodologies 
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Trace-driven simulation is often used in the design of computer systems, especially caches 
and translation lookaside buffers. Capturing address traces to drive such simulations has 
been problematic, often involving 1000:1 software overhead to trace a target workload, 
and/or mechanisms that cause significant distortions in the recorded data. A new technique 
for capturing address traces has been developed to use a processor's microcode to record 
addresses in a reserved part of main memory as ... 

1 9 Desi g n of a user-microprogrammable building block 

Michael Kraley, Randall Rettberg, Philip Herman, Robert Bressler, Anthony Lake 
November 1980 Proceedings of the 13th annual workshop on Microprogramming 
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A user-microprogrammable computer has been developed for use as a building block in 
general-purpose and dedicated computer systems. The architecture is designed to be easily 
microprogrammed and features a 32-bit, vertically oriented microinstruction. The processor 
has a 135-nanosecond cycle time, either 16- or 20-bit macro data paths, and 1024 
hardware registers. A significant fraction of the processor bandwidth may be budgeted for 
I/O processing to allow the substitution of microcode for e ... 

20 Architecture of SOAR: Smalltalk on a RISC 

David Ungar, Ricki Blau, Peter Foley, Dain Samples, David Patterson 

January 1984 ACM SZGARCH Computer Architecture News , Proceedings of the 11th 

annual international symposium on Computer architecture, volume 12 issue 3 
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Smalltalk on a RISC (SOAR) is a simple, Von Neumann computer that is designed to execute 
the Smalltalk-80 system much faster than existing VLSI microcomputers. The Smalltalk-80 
system is a highly productive programming environment but poses tough challenges for 
implementors: dynamic data typing, a high level instruction set, frequent and expensive 
procedure calls, and object-oriented storage management. SOAR compiles programs to a 
low level, efficient instruction set. Parallel tag checks pe ... 
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* ABSTRACT 

Increasing levels of microprocessor power dissipation call for new approaches at the architectural 
level that save energy by better matching of on-chip resources to application requirements. Selective 
cache ways provides the ability to disable a subset of the ways in a set associative cache during 
periods of modest cache activity, while the full cache may remain operational for more cache- 
intensive periods. Because this approach leverages the subarray partitioning that is already present 
for performance reasons, only minor changes to a conventional cache are required, and therefore, 
full-speed cache operation can be maintained. Furthermore, the tradeoff between performance and 
energy is flexible, and can be dynamically tailored to meet changing application and machine 
environmental conditions. We show that trading off a small performance degradation for energy 
savings can produce a significant reduction in cache energy dissipation using this approach. 
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Specialization has been recognized as a powerful technique for optimizing operating 
systems. However, specialization has not been broadly applied beyond the research 
community because current techniques based on manual specialization, are time-consuming 
and error-prone. The goal of the work described in this paper is to help operating system 
tuners perform specialization more easily. We have built a specialization toolkit that assists 
the major tasks of specializing operating systems. We de ... 
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Andrew is a distributed computing environment that is a synthesis of the personal 
computing and timesharing paradigms. When mature, it is expected to encompass over 
5,000 workstations spanning the Carnegie Mellon University campus. This paper examines 
the security issues that arise in such an environment and describes the mechanisms that 
have been developed to address them. These mechanisms include the logical and physical 
separation of servers and clients, support for secure communication ... 
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File caching is essential to good performance in a distributed system, especially as processor 
speeds and memory sizes continue to improve rapidly while disk latencies do not. Stateless- 
server systems, such as NFS, cannot properly manage client file caches. Stateful systems, 
such as Sprite, can use explicit cache consistency protocols to improve both cache 



consistency and overall performance. By modifying NFS to use the Sprite cache consistency 
protocols, we isolate the effects o ... 
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The demand for high-performance architectures and powerful battery-operated mobile 
devices has accentuated the need for low-power systems. In many media and embedded 
applications, the memory system can consume more than 50&percnt; of the overall system 
energy, making it a ripe candidate for optimization. To address this increasingly important 
problem, this article studies energy-efficient cache architectures in the memory hierarchy 
that can have a significant impact on the overall system energy ... 
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The caching behavior of multimedia applications has been described as having high 
instruction reference locality within small loops, very large working sets, and poor data 
cache performance due to non-locality of data references. Despite this, there is no published 
research deriving or measuring these qualities. Utilizing the previously developed Berkeley 
Multimedia Workload, we present the results of execution driven cache simulations with the 
goal of aiding future media processing architect ... 
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The effective management of caches is critical to the performance of applications on shared- 
memory multiprocessors. In this paper, we discuss a technique for software cache 
coherence tht is based upon the integration of a program-level abstraction for shared data 
with software cache management . The program-level abstraction, called Shared Regions, 
explicitly relates synchronization objects with the data they protect. Cache coherence 
algorithms are presented which use the i ... 
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One of the fundamental limits to high-performance, high-reliability file systems is memory's 
vulnerability to system crashes. Because memory is viewed as unsafe, systems periodically 
write data back to disk. The extra disk traffic lowers performance, and the delay period 



before data is safe lowers reliability. The goal of the Rio (RAM I/O) file cache is to make 
ordinary main memory safe for persistent storage by enabling memory to survive operating 
system crashes. Reliable memory enables a syste ... 
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To meet the demands of modern architectures, optimizing compilers must incorporate an 
ever larger number of increasingly complex transformation algorithms. Since code 
transformations may often degrade performance or interfere with subsequent 
transformations, compilers employ predictive heuristics to guide optimizations by predicting 
their effects a priori. Unfortunately, the unpredictability of optimization interaction and the 
irregularity of today's wide-issue machines severely limit the accura ... 
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In order to run untrusted code in the same process as trusted code, there must be a 
mechanism to allow dangerous calls to determine if their caller is authorized to exercise the 
privilege of using the dangerous routine. Java systems have adopted a technique called 
stack inspection to address this concern. But its original definition, in terms of searching 
stack frames, had an unclear relationship to the actual achievement of security, 
overconstrained the implementation of a Java system, lim ... 
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The purpose of a distributed file system (DFS) is to allow users of physically distributed 
computers to share data and storage resources by using a common file system. A typical 
configuration for a DFS is a collection of workstations and mainframes connected by a local 
area network (LAN). A DFS is implemented as part of the operating system of each of the 
connected computers. This paper establishes a viewpoint that emphasizes the dispersed 
structure and decentralization of both data and con ... 
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This paper discusses implementations of fine-grain memory access control, which selectively 
restricts reads and writes to cache-block-sized memory regions. Fine-grain access control 
forms the basis of efficient cache-coherent shared memory. This paper focuses on low-cost 
implementations that require little or no additional hardware. These techniques permit 
efficient implementation of shared memory on a wide range of parallel systems, thereby 
providing shared-memory codes with a portability ... 
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Instruction traces are useful tools for studying many aspects of computer systems, but they 
are difficult to gather without perturbing the systems being traced. In the past, researchers 
have collected instruction traces through various techniques, including single-stepping, 
instruction inlining, hardware monitoring, and processor simulation. These approaches, 
however, fail to produce accurate traces because they interfere with the processor's normal 
execution. Because processors are deterministic ... 
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CHAOSarc is an object-based multiprocessor operating system kernel that provides 
primitives with which programmers may easily construct objects of differing types and 
object invocations of differing semantics, targeting multiprocessor systems, and real-time 
applications. The CHAOSarc can guarantee desired performance and functionality levels of 
selected computations in real-time applications. Such guarantees can be made despite 
poss ... 
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We have performed a study of the usage of the Windows NT File System through long-term 
kernel tracing. Our goal was to provide a new data point with respect to the 1985 and 1991 
trace-based File System studies, to investigate the usage details of the Windows NT file 
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system architecture, and to study the overall statistical behavior of the usage data. In this 
paper we report on these issues through a detailed comparison with the older traces, 
through details on the operational characteristics and ... 
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The rapid growth of the World Wide Web has caused serious performance degradation on 
the Internet. This paper offers an end-to-end approach to improving Web performance by 
collectively examining the Web components — clients, proxies, servers, and the network. 
Our goal is to reduce user-perceived latency and the number of TCP connections, improve 
cache coherency and cache replacement, and enable prefetching of resources that are likely 
to be accessed in the near future. In our scheme, server re ... 
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Speculative thread-level parallelization is a promising way to speed up codes that compilers 
fail to parallelize. While several speculative parallelization schemes have been proposed for 
different machine sizes and types of codes, the results so far show that it is hard to deliver 
scalable speedups. Often, the problem is not true dependence violations, but sub-optimal 
architectural design. Consequently, we attempt to identify and eliminate major architectural 
bottlenecks that limit the scalab ... 
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Object combining tries to put objects together that have roughly the same life times in order 
to reduce strain on the memory manager and to reduce the number of pointer indirections 
during a program's execution. Object combining works by appending the fields of one object 
to another, allowing allocation and freeing of multiple objects with a single heap (de) 
allocation. Unlike object inlining, which will only optimize objects where one has a (unique) 
pointer to another, our optimization al ... 
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